Generalized linear models

So far we have used linear regressions, where we modeled the outcome as follows:

\[ y \sim Normal(\mu,sigma) \\ \mu = \alpha + \beta X \]

This regression works generally well, even if \(\small y\) is not normally distributed. (The residuals should be though. And a posterior predictive plot is always useful!)

Distributions

However, for some outcome types a linear regression simply does not work. This is particularity clear when the outcome is bound to be equal to or larger than zero (like counts that “clump” at zero) or when the outcome is binary.

As an example the next figure shows end of year math grades in a Portuguese school.1

df=read.table("data/student-mat.csv",sep=";",header=TRUE)
df = df[df$Medu>0,]
set_par()
x = barplot(c(table(df$G3),0,0,0), xaxt = "n", border = "grey")
axis(1,at = x[seq(0,20,2)+1], labels = seq(0,20,2))

In these situations we can use this kind of model:

\[ y \sim dist(\theta_1,\theta_2) \\ \]

Here \(\small dist(\theta_1,\theta_2)\) is a distribution with two parameters2 that is consistent with the observed data.

Choosing a different distribution means choosing a different likelihood function. That is, we exchange dnomr with an alternative distribution that is appropriate for the data.

Here are a few examples:

  • The Binomial distribution models the “number of successes in a sequence of n independent experiments, each asking a yes–no question”. Hence we use the Binomial distribution function to calculate the likelihood of the data given model and parameters when we model number of successes. A special case is when we have only one trial / experiment, the Binomial distribution models binary outcomes.
set_par()
x = seq(0,5,1)
plot(x-.2,dbinom(x, 5, .2),"h", xlim = c(0,20), lwd = 2,
     ylab = "probability mass", xlab = "number of success")
x = seq(0,10,1)
lines(x,dbinom(x, 10, .5),"h", col = "blue", lwd = 2)
x = seq(0,20,1)
lines(x+.2,dbinom(x, 20, .75),"h", col = "red", lwd = 2)
legend("topright",
       lwd = 2, col = c("black","blue","red"),
       legend = c(
         expression(theta[1]~" = n = 5, "~theta[2]~" = p = .2"),
         expression(theta[1]~" = n = 10, "~theta[2]~" = p = .5"),
         expression(theta[1]~" = n = 30, "~theta[2]~" = p = .75")
       ),
       bty = "n")

  • The Poisson distribution “expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate” . Hence we use the Poisson distribution function to calculate the likelihood of the data given model and parameters when we occurrence of events (counts).
set_par()
x = seq(0,5,1)
plot(x-.2,dpois(x, .5),"h", xlim = c(0,20), lwd = 2,
     ylab = "probability mass", xlab = "number of events")
x = seq(0,10,1)
lines(x,dpois(x, 5),"h", col = "blue", lwd = 2)
x = seq(0,20,1)
lines(x+.2,dpois(x, 10),"h", col = "red", lwd = 2)
legend("topright",
       lwd = 2, col = c("black","blue","red"),
       legend = c(
         expression(theta[1]~" = "~lambda~" = p =  .5"),
         expression(theta[1]~" = "~lambda~" = p = 5"),
         expression(theta[1]~" = "~lambda~" = p = 10")
       ),
       bty = "n")

Logistic regression

For a hands on example for logistic regression we will try to predict failing the math class with the Portugese school data shown above. In particular, we will use these predictors:

It is always a good idea to plot the data first:

data.list = list(
  Medu = df$Medu,
  failures = df$failures,
  fail = df$G3 == 0
) 
set_par(mfrow = c(1,2))
table(data.list$Medu) %>% barplot(border = "grey", ylab = "count", xlab = "maternal edu")
table(data.list$failures) %>% barplot(border = "grey",   xlab = "number fails")

We are de-meaning Medu, so that the intercept measures odds of average maternal education.

data.list$cMedu = data.list$Medu-2.5

We beging the analysis with an intercept model only:

model = alist(
  fail ~ dbinom(1,p),
  logit(p) <- a,
  a ~ dnorm(0,10)
)

Prior predictive check

And we estimate the model with ulam (i.e. Stan):

u.fit_I = ulam(
  model,
  data = data.list,
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

We extract the prior and plot the prior prediction for the failure rate:

prior = extract.prior(u.fit_I)
set_par()
hist(inv_logit(prior$a), main="")

To see what is going on here, lets overlay the logistic function on the prior:

set_par(mfrow = c(1,2), mar=c(3,3,3,3), cex = 1)

curve(dnorm(x,0,10),-30,30, ylab = "prior density", xlab = "a")
axis(4,at = seq(0,0.04, length.out = 5),
     labels = seq(0,1,length.out = 5), col = "red")
curve(inv_logit(x)/25, add = T, col = "red")
abline(h = .1/25, lty = 3, col = "grey")
title("a ~ dnorm(0,10)")

curve(dnorm(x,0,2),-6,6, ylab = "prior density", xlab = "a")
axis(4,at = seq(0,0.2, length.out = 5),
     labels = seq(0,1,length.out = 5), col = "red")
curve(inv_logit(x)/5, add = T, col = "red")
abline(h = .1/5, lty = 3, col = "grey")
title("a ~ dnorm(0,2)")

With a wide prior, most of the probability mass is smaller than -5, leading to p very close to 0, or larger than 5, leading to p close to 1.

How about the regression coefficients \(\small beta\)? Regression coefficients are log odds ratios.

For instance, lest assume the following:

  • for children who never failed a class, 2 out of 100 children fail the class now
  • for children who once failed a class, 4 out of 100 children fail the class

then the odds ratio is :

\[ \begin{align*} OR = & \frac{\textrm{failure odds no prior fail}}{\textrm{failure odds prior fail}} \\ = & \frac{\frac{2}{98}}{\frac{4}{96}} = \frac{.0204}{.0417} = .49 \end{align*} \]

Note that this specific odd ratio comes close to the risk ratio of \(\small .2 / .4 = .5\).

This is, however, not generally the case. If the probabilities are far away from zero, risk ratio and odds ratio do not align! We can see this if we just multiply the probability of failure by 15 in both groups:

\[ \begin{align*} OR = \frac{\frac{30}{70}}{\frac{60}{40}} = \frac{.428}{.667} = .29 \end{align*} \]

Lets go back to specifying our prior for \(\small \beta\):

  • We had an odds ration of 0.49, which corresponds to a log odds ratio of log(.49) ~ -.7.
  • This means that if we assume that that failure probability is low and a shift from no to one prior fail comes with a doubling of class-fail probability, we should allow \(\small \beta\) values of size .7.
  • If we wanted to set an informative prior, which however does not prefer one direction of the effect, we could set a dnorm(0,.7) prior.
  • But we are not so sure and don’t like to informative priors, so we set a dnorm(0,2) prior.

Lest specify such a model and look at the prior predictions for the difference between two levels of education:

model = alist(
  fail ~ dbinom(1,p),
  logit(p) <- a + b1*failures,
  a ~ dnorm(0,2),
  b1 ~ dnorm(0,2)
)
u.fit_F = ulam(
  model,
  data = data.list,
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

prior = extract.prior(u.fit_F)

Now we can use the link_ulam function and new data to generate predictions from the prior. First we look at the effect of a one level change of education.

p = link_ulam(
  u.fit_F, post = prior, 
  data = list(failures = c(0,1)))
set_par(mfrow = c(1,2), mar=c(3,3,3,3), cex = 1)
hist(p[,2]-p[,1],
     main = "risk difference", xlab = "P(fail|high) - P(fail|low)")
hist(p[,2]/p[,1], breaks = 30,
     main = "risk ratio", xlab = "P(fail|high) / P(fail|low)")

These differences and ratios are pretty large, so it is safe to make the prior a bit narrower and set the standard deviation for the regression weights to 1.

Here is the our model so far, which include previous class fails and maternal education as predictors:

model = alist(
  fail ~ dbinom(1,p),
  logit(p) <- a + b1*failures + b2*cMedu,
  a ~ dnorm(0,2),
  b1 ~ dnorm(0,1),
  b2 ~ dnorm(0,1)
)
u.fit_FE = ulam(
  model,
  data = data.list,
  log_lik = TRUE,   # for model comparison
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

Some quick convergence diagnostics:

precis(u.fit_FE, depth = 2)
##          mean        sd       5.5%       94.5%    n_eff     Rhat4
## a  -2.5837038 0.2204297 -2.9542099 -2.24176580 2273.241 0.9992693
## b1  0.6900335 0.1770531  0.4095241  0.98004828 2358.207 1.0000829
## b2 -0.2939670 0.1704245 -0.5669423 -0.02034664 3005.834 1.0000001

Posterior predictive check

To see if our model describes the data well, we plot model predicted and observed data together:

aggr.fun = function(x) return(c(mean = mean(x), N = length(x)))
obs = 
  aggregate(
  data.list$fail,
  by = with(data.list, 
            data.frame(cMedu = cMedu, Medu = Medu, failures = failures)), 
  aggr.fun
  )
obs = cbind(obs,obs$x)

sim.data = 
  unique(as.data.frame(data.list)[,c(2,4)])

p = link_ulam(
  u.fit_FE,
  data = sim.data)

pp.stats = cbind(
  sim.data,
  m = colMeans(p),
  t(apply(p,2,PI))
)

set_par(mfrow = c(2,2), mar=c(3,3,1.5,0.5), cex = .75)
tmp = 
  lapply(unique(obs$cMedu), function(x) {
  plot(obs[obs$cMedu == x,"failures"],
       obs[obs$cMedu == x,"mean"],
       cex = sqrt(obs[obs$cMedu == x,"N"]),
       xlim = c(-.5,3.5), ylim = c(0,.5),
       ylab = "proportion fail",
       xlab = "# past failures",
       xaxt = "n", pch = 16, col = "grey")
    title(paste("maternal edu",x+2.5),line = 0.5, cex.main = 1)
    axis(1,at = 0:3)
    points(pp.stats[pp.stats$cMedu == x,"failures"],
       pp.stats[pp.stats$cMedu == x,"m"], pch = 16, col = "blue")
    arrows(pp.stats[pp.stats$cMedu == x,"failures"],
           y0 = pp.stats[pp.stats$cMedu == x,"5%"],
           y1 = pp.stats[pp.stats$cMedu == x,"94%"],
           length = 0, col = "blue")
})

This does not look great, we can try to improve the model in two ways:

  • give each level of maternal education its own intercept
  • give each level of maternal education its own slope
model = alist(
  fail ~ dbinom(1,p),
  logit(p) <- a[Medu] + b[Medu]*failures,
  a[Medu] ~ dnorm(0,2),
  b[Medu] ~ dnorm(0,1)
)
u.fit_FE.2 = ulam(
  model,
  data = data.list,
  log_lik = TRUE,   # for model comparison
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

Did the chains converge?

precis(u.fit_FE.2)
## 8 vector or matrix parameters hidden. Use depth=2 to show them.
## [1] mean  sd    5.5%  94.5% n_eff Rhat4
## <0 rows> (or 0-length row.names)

This looks OK, so we do again a posterior predictive check:

sim.data = 
  unique(as.data.frame(data.list)[,c(1,2)]) 

p = link_ulam(
  u.fit_FE.2,
  data = sim.data)

pp.stats = cbind(
  sim.data,
  m = colMeans(p),
  t(apply(p,2,PI))
)

set_par(mfrow = c(2,2), mar=c(3,3,1.5,0.5), cex = .75)
tmp = 
  lapply(unique(obs$Medu), function(x) {
  plot(obs[obs$Medu == x,"failures"],
       obs[obs$Medu == x,"mean"],
       cex = sqrt(obs[obs$Medu == x,"N"]),
       xlim = c(-.5,3.5), ylim = c(0,.5),
       ylab = "proportion fail",
       xlab = "# past failures",
       xaxt = "n", pch = 16, col = "grey")
    title(paste("maternal edu",x),line = 0.5, cex.main = 1)
    axis(1,at = 0:3)
    points(pp.stats[pp.stats$Medu == x,"failures"],
       pp.stats[pp.stats$Medu == x,"m"], pch = 16, col = "blue")
    arrows(pp.stats[pp.stats$Medu == x,"failures"],
           y0 = pp.stats[pp.stats$Medu == x,"5%"],
           y1 = pp.stats[pp.stats$Medu == x,"94%"],
           length = 0, col = "blue")
})

This does not look that much better. Let’s check this with PSIS-LOO:

compare(u.fit_FE.2,u.fit_FE, func = "PSIS")
##                PSIS       SE    dPSIS      dSE    pPSIS     weight
## u.fit_FE   232.9903 25.47049 0.000000       NA 3.039043 0.97519255
## u.fit_FE.2 240.3333 25.64213 7.342982 3.236801 7.300440 0.02480745

In this case the added complexity has not helped much. So lets look at the parameters of the first model:

plot(precis(u.fit_FE, depth = 2))

summary = precis(u.fit_FE, depth = 2) %>% round(2)
summary
##     mean   sd  5.5% 94.5%   n_eff Rhat4
## a  -2.58 0.22 -2.95 -2.24 2273.24     1
## b1  0.69 0.18  0.41  0.98 2358.21     1
## b2 -0.29 0.17 -0.57 -0.02 3005.83     1

What does this all mean?

Because the parameters are log-odds, we exponentiate them to get OR. Because the probability of our outcome is (relatively) close to zero, we can interprete the OR as approximate risk ratios. Hence

  • the intercept value is -2.58, which gives a baseline risk of 0.075774
  • the regression weight failures is 0.69, which means that the odds ratio and approximate relative risk to fail the class if one has failed it once before is exp(0.69) 1.9937155.
  • the regression weight maternal education is -0.29, which means that the odds ratio and approximate relative risk to fail the class if one has failed it once before is exp(-0.29) = 0.7482636. Correspondingly, if maternal education is changed by two levels, the OR (~RR) is exp(2 * -0.29) = 0.5598984

Constrasts

The OR and relative risks that we have to deal with due to the link function make it hard to interpret the results. But if we create posterior predictions we can simply calculate contrasts.

Lets look at the data we used to generate posterior predictions:

sim.data = 
  unique(as.data.frame(u.fit_FE@data)[, c("cMedu","failures","Medu")]) 
sim.data = sim.data[order(sim.data$cMedu,sim.data$failures),]
rownames(sim.data) = NULL
sim.data
##    cMedu failures Medu
## 1   -1.5        0    1
## 2   -1.5        1    1
## 3   -1.5        2    1
## 4   -1.5        3    1
## 5   -0.5        0    2
## 6   -0.5        1    2
## 7   -0.5        2    2
## 8   -0.5        3    2
## 9    0.5        0    3
## 10   0.5        1    3
## 11   0.5        2    3
## 12   0.5        3    3
## 13   1.5        0    4
## 14   1.5        1    4
## 15   1.5        2    4

If we now generate the posterior predictions, we get a matrix in which each column corresponds to one row in sim.data

pp = link_ulam(
  u.fit_FE,
  data = sim.data)
dim(pp)
## [1] 4000   15

A simple question would be what the effect of having one vs zero previous failures is when maternal educational level is 1. We simply look up in the sim.data table that we need to compare columns one and two in the posterior predictions for this. In the same manner we can do this for maternal educational level 2.

delta.L1 = pp[,2] - pp[, 1]
delta.L2 = pp[,6] - pp[, 5]
xlim = range(c(delta.L1,delta.L2))
breaks = seq(xlim[1]-.01,xlim[2]+.01, length.out = 25)
set_par(mfrow = c(3,1))
hist(delta.L1, main = "", xlim = xlim, breaks = breaks)
abline(v = 0, col = "red", lty = 3)
hist(delta.L2, main = "", xlim = xlim, breaks = breaks)
abline(v = 0, col = "red", lty = 3)
hist(delta.L2-delta.L1, main = "")
abline(v = 0, col = "red", lty = 3)

The third histogram shows an interaction effect on the scale of risk differences, even though our model has no interaction term! This is due to the exponent in the link function!

If we for instance want to see the difference in fail probability between maternal education levels 2 and 3, we calculate one vector of samples with fail probabilities at level 1, one for level 4, and subtract them:

fail.level1 = rowMeans(pp[,1:4])
fail.level4 = rowMeans(pp[,13:15])
delta = fail.level4 - fail.level1
set_par()
hist(delta)

However, this analysis assumed that the pupils with different number of previously failed classes are equally frequent, which is not the case. So we need to weight with these probabilities:

M1_p.failures = prop.table(table(df$failures[df$Medu == 1]))
M4_p.failures = prop.table(table(df$failures[df$Medu == 4]))
fail.level1 = pp[,1:4] %*% M1_p.failures 
fail.level4 = pp[,13:15] %*% M4_p.failures 
delta = fail.level4 - fail.level1
set_par()
hist(delta)


  1. P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Data can be downloaded here↩︎

  2. There are also distributions with 1 or 3 or more parameters↩︎